beautiful soup+ Xpath的基础操作

Beautiful Soup 是 python 的一个库，最主要的功能是从网页抓取数据,BS4是Beautiful Soup V4.x的简称。
为什么选用BS4呢？可以尽可能的规避复杂的正则表达式RE。而功能上，配合requests也是相当好的。

官方的中文教程可参考此链接【点我】

而另一篇博客写的更细致，可参考【转载请注明：静觅 » Python爬虫利器二之Beautiful Soup的用法

以下是分别通过BS4和Xpath方式得到的简书网的首页“热门”的20偏文章的标题

完整的代码如下：

#coding:utf-8

import re
import urllib
from bs4 import BeautifulSoup
import time
from lxml import etree

def getHtml(url):
    page = urllib.urlopen(url)
    html = page.read()
    return html

html = getHtml("http://www.jianshu.com")

#html 是爬取的网页源代码, 将一段文档传入BeautifulSoup 的构造方法,就能得到一个文档的对象
soup = BeautifulSoup(html,'html.parser',from_encoding='utf-8')
#查找所有的h4标签 
links = soup.find_all("h4")

for link in links:
    print link


time.sleep(5)



selector = etree.HTML(html)
links = selector.xpath('//h4/a/text()')
for link in links:
    print link

说明：其中需要会查阅HTML的基础技巧，用一个图动图(来自本文末尾的文章)来概括：

不过第一个方法，显示的是乱码，但是一开始就规定了UTF8格式，原因待定。如图：

参考：向右奔跑的文章【点我】